-------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Challenging Encounters and Within-Phyisician Variability Practice
DATA README
* Authors: Gabriel Chodick, Yoav Goldstein, Ity Shurtz, Dan Zeltzer
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

* Data overview: 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
The data used for this research are from Maccabi Health Services, Israel. 
These data contain individual administrative data (health insurance and electronic health records), 
so they are both proprietary and contain confidential patient information. 
To protect the privacy of patients, we do not post the data online. 
However, interested researchers can apply for data access for replication-related purposes, 
under the appropriate arrangements for protecting patient confidentiality and research ethics. 
Inquiries about the data can be directed to Ity Shurtz, at shurtz@bgu.ac.il

* Software Requirements
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
The code was written in R, and executed using R version 4.1.3 using the following packages:
dplyr, haven, readxl, lmtest, lubridate, miceadds, plm, xgboost, xtable, stargazer

* Code Overview
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
This readme provides all the details for data building, analysis, and exhibits replicaton.

Raw data is ingested as csv, and Rdata files (see codebook below) and stored in a subfolder "./raw_date" 
(all paths are relative to ".", the project directory).

To reproduce the exhibits from raw data, the following R scripts should be executed in the listed order,
in the project directory:
1. Events.R
2. Data.R
3. PS.R
4. Analysis.R
Details of the scripts are provided below.

* Aprroximate runtime
-- Analysis only: 2 hours
-- Data building + analysis: 32 hours
- All scripts are provided. Below is a description of each script. 
Inquiries about the code can be directed to Yoav Goldstein at yoavg2@mail.tau.ac.il

* Scripts description 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
1. Events.R: 
- Purpoe: sample difficult cases.
- Inputs: Diagnosis.Rdata; Cancer.csv; Visits.Rdata; Patients.Rdata
- Output: Events.Rdata # Yoav, disambiguate R scripts from RDS files. 
- Approximate runtime: 10 minutes

2. Data.R: 
- Purpose: build the study database, excluding the propensity score of testing
- Inputs: Visits.Rdata; Events.Rdata; Lab_Referrals.Rdata; Referrals.Rdata; Prescriptions.Rdata; Patients.Rdata; Doctor_Characteristics.Rdata; CancerStatus.xlsx
- Output: data_final.Rdata
- Approximate runtime: 10 hours

3. PS.R: 
- Purpose: estimate the propensity score of testing and predict the score for the study sample
- Inputs: Visits.Rdata; Events.Rdata; Lab_Referrals.Rdata; Referrals.Rdata; Prescriptions.Rdata; Patients.Rdata; Doctor_Characteristics.Rdata; CancerStatus.xlsx
- Output: data_final_with_prediction.Rdata
- Approximate runtime: 10 hours (without hyper-tuning the xgboost model's parameters)

4. Analysis.R: 
- Purpose: produce the paper exhibits
- Input: data_final_with_prediction.Rdata
- Outputs: Figures and Tables
- Approximate runtime: 2 hours


* Project directory structure
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
"." the root directory on the server, including the raw data
"./project" the paper's directory, including the code, research data, and outputs

* Raw data files 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
The data files include patient medical records. Below is a short description of each file and 
the unit of observation (in parentheses)

Patients.Rdata: patient informaiotn (patient)
Visits.Rdata: primary care encounters (visit)
Diagnosis.Rdata: patient diagnoses (a dated diagnosis)
Lab_referrals.Rdata: referrals to labs and imaging (referral event)
Referrals.Rdata: referrals to other providers (referral event)
Prescriptions.Rdata: prescriptions (prescription event)
Doctor_Characteristics.Rdata: physician characteristics (physician)
Cancer.csv: cancer registry data (cancer case)
CancerStatus.xlsx: cancer registry death records (cancer case)

* Codebooks (lists of variables in each file)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Patients.Rdata
----------------
patient_id	Patient identifier
Insurance	Insurance name
insurance_type	Insurance Type
insurance_from_date	Insurance starting date
insurance_changed	Insurance changed during the period
district_code	Patient district code
doctor_id	Doctor identifier
facility_id	Clinic identifier
branch_id	Branch identifier
status_in_cancer	Cancer status
d_cancer	Date of cancer registry
d_to_cancer	Date of cancer registry (recovery)
cancer_reg_cd	Cancer Code
ses_ags	Socio Economic Status (from the National Social Security)
ses_level	Socio Economic Status (from a private company)
Status	Patient status
d_cardio	Date from cardio
d_cardihd	Date from cardihd
d_cardiomain	Date from cardio main deas
d_cardipir	Date from cardio Pirpur
d_cardmi	Date from cardmi
d_cva	Date from CVA
d_tia	Date from TIA
d_cvd	Date from CVD
d_pvd	Date from PVD
d_chf	Date from CHF
d_trans	Date from Transplant
d_valve	Date from Valve
d_daibetic	Date from Daibetic
d_dializa	Date from Dializa
d_fertility	Date from Fertility
d_gushe	Date from Gushe
d_hashmana	Date from Hashmana
d_ckd	Date from CKD
d_copd	Date from COPD
d_b_pressure	Date from Blood Pressure
d_o_mishkal	Date from Odef Mishkal
d_t_bait	Date from Tipul Bait
d_osteo	Date from Osteoporosis
d_diabetic_risk	Date from Pre Diabetic High Risk

Visits.Rdata
----------------
Date	Visit date
Time	Visit time
Month	Visit month
patient_id	Patient identifier
doctor_id	Physician identifier
specialization_code	Specialization code
visit_length	Visit duration (in minutes)

Diagnosis.Rdata
----------------
Date	Diagnosis date
Month	Diagnosis month
patient_id	Patient identifier
diag_doctor_id 	Physician identifier
facility_id	Clinic identifier
maccabi_code	Maccabi diagnosis recognition code
ICD9	International Classification of Diseases 9
diagnosis_desc	Diagnosis name
provider_code	*unused*
facility_desc	Clinic name

Lab_referrals.Rdata and
Referrals.Rdata
----------------
Date	Referral date
Month	Referral month
patient_id	Patient identifier
doctor_id 	Physician identifier
facility_id	Clinic identifier
cl_role_code	*unused*
Form_row	*unused*
form_type	Form type
form_code	Form code
test_code	Lab test code
form_desc	Form name
Test_desc	Lab test name

Prescriptions.Rdata
----------------
Date	Visit date
Month	Visit month
Time	Visit time
patient_id	Patient identifier
doctor_id 	Physician identifier
pres_date	Prescription date
m_pres	Prescription month
largo_code	Drug largo code
drug_desc	Drug ingredient name
drug_code	Drug ingredient code
mirsham_number	*unused*
sifrur_mirsham	*unused*

Doctor_Characteristics.Rdata
----------------
doctor_id	Physician identifier
Facility_id	Clinic identifier
Occupation	Occupation code
area_desc	Area name
doc_gender	Gender
doc_university	University name
study_country	University’s country name
doc_time_from _study	Years from graduation
doc_age	Age
doc_exper	Years in Maccabi
d_immigration	Indicator for immigrant
uni_code	University code

Cancer.csv
------------------
RANDOM_ID	Patient identifier
CANCER_REG	Cancer registration
DETEFROMCANCER Date of cancer registry
DETETOCANCER Date of cancer registry (recovery)
STATUS_IN_CANCER Cancer status

CancerStatus.xlsx
----------------
TZNUM	Patient identifier
DATE_DEATH	Death date
DETEFROMCANCER Date of cancer registry
DETETOCANCER Date of cancer registry (recovery)
STATUS_IN_CANCER Cancer status